Correspondence and Component Analysis

نویسندگان

  • JAN DE LEEUW
  • GEORGE MICHAILIDES
  • DEBORAH Y. WANG
چکیده

This paper is a non-standard introduction to (multiple) correspondence analysis and nonlinear principal component analysis. We start with a brief introduction to the classical (geometrical) motivation for the technique. We then restart from scratch, using the problem of maximizing an aspect of a multivariable or its correlation matrix. Some general algorithmic considerations are briefly discussed, and then we specialize our criteria in such a way that they define principal component analysis. Next, we introduce two different ways of defining multidimensionality. Various properties of the techniques derived in this way are studied. 1. Correspondence Analysis Correspondence Analysis can be introduced in many different ways, which is probably the reason why it was reinvented many times over the years. We do not repeat the various derivations in this paper, instead we refer to the extensive discussion in the books by Greenacre [1984], Gifi [1990], and Benzécri [1992]. Usually, correspondence analysis is motivated in graphical language. It is often said, in this context, that “A picture is worth a thousand numbers.” Complicated multivariate data are made more accessible by displaying the main regularities of the data in scatterplots. This graphical approach is outlined in considerable detail in the books mentioned above, and in the review articles by Hoffman and de Leeuw [1992] or Michailides and de Leeuw [1996]. We merely give a brief introduction, which differs in some important aspects from earlier ones, because it emphasizes the graph plot and the star plots (defined below). We think these plots introduction nicely capture the essential geometric characteristics of the technique. We have to choose one of the many names the technique has had over the years (see de Leeuw [1973, 1983] for an historical overview). The most widely used name seems to be (Multiple) Correspondence Analysis or MCA, and this is what we shall use as well. Date: December 15, 1997. 1 2 JAN DE LEEUW, GEORGE MICHAILIDES, AND DEBORAH Y. WANG 1.1. Data. MCA starts with n observations onm categorical variables, where variable j has kj categories (possible values). Using categorical variables causes no real loss of generality: so-called continuous variables are merely categorical variables with a large number of numerical categories. We use K for the total number of categories over all variables. The data are coded as m indicator matrices or dummies Gj, where Gj is a binary n×kj matrix with exactly one non-zero element in each row i (indicating in which category of variable j observation i falls). The n×K matrix G = (G1| . . . |Gm) is called the indicator supermatrix. 1.2. Graph Drawing. One can represent all information in the data by a bipartite graph with n + K vertices and nm edges. Each edge connects an object and a category. Thus the n vertices corresponding with the objects all have degree m, the K vertices corresponding with the categories have varying degrees, equal to the number of objects in the category. The indicator supermatrix G is the adjacency matrix of the graph. We can make a drawing of the graph, by placing the vertices at n+K locations in the plane, or, more generally, in R. If we then draw the nm edges, then the resulting picture will generally be more informative and more aesthetically pleasing if the edges are short. In other words, if objects are close to the categories they fall in, and categories are close to the objects falling in them. Thus we want to make a graph plot that “minimizes the amount of ink”, i.e. the total length of all edges. There is a substantial literature in computer science about methods and criteria to draw graphs [di Battista et al., 1994]. Graph drawing algorithms for bipartite graphs that emphasize minimizing edge crossing are discussed in Eades and Wormald [1994]. MCA, i.e. our “minimum ink” criterion, is closely related to the force-directed or spring algorithms first introduced by Eades [1984]. Many of the criteria discussed in the computer science literature lead to NP-complete problems, i.e. they are computationally infeasible even for fairly small problems. Our edge-length algorithm is designed to be practical even for very large bipartite graphs. Actually, we will minimize the total squared length of the edges. The reasons for choosing the square are the classical ones. Of all the principles that can be proposed for this purpose, I think there is none more general, more exact, or easier to apply, that that which we have used in this work; it consists of making the sum of squares of the errors aminimum. By this method, a kind of equilibrium is CORRESPONDENCE AND COMPONENT ANALYSIS 3 established among the errors which, since it prevents the extremes from dominating, is appropriate for revealing the state of the system which most nearly approximates the truth. Legendre [1805], quoted by Stigler [1986, p. 13]. In order to implement a useful algorithm, we also need a normalization constraint. This is needed because we want distances between vertices that are connected to be small, but we do not require distances between edges that are not connected to be large. Merely minimizing the amount of ink, without requiring a normalization, can be done by collapsing the drawing into a single point. 1.3. Least Squares Criterion. To formalize our “minimum ink” criterion in a convenient way, we use the indicator matrices Gj. If the n×p matrix X has the locations of the object vertices in R, and Yj has the location of the kj category vertices of variable j, then the squared length of the n edges for variable j is σj(X, Yj) = SSQ(X −GjYj), (1) where SSQ() is short for the sum of squares. The corresponding graph drawing, with n+kj vertices and n edges, is known as the star plot for variable j. The graph plot is the union (overlay) of the m star plots. The squared edge length over all variables is

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementation of "MSG-3" for Crack Growth Analysis of Aircraft SSI Components

The main goal of this article is to implement the MSG-3 process in structure field for an SSI component of b747 aircraft. This process is expected toincrease the ease of aircraft maintenance and its safety level. MSG-3 logic took a top-down or consequence of failure approachmeaning that MSG-3 reduces the maintenance cost and upgrades safety. Moreover, it cansignificantly help saving thetime and...

متن کامل

Benthic Macroinvertabrate distribution in Tajan River Using Canonical Correspondence Analysis

The distribution of macroinvertebrate communities from 5 sampling sites of the Tajan River were used to examine the relationship among physiochemical parameters with macroinvertebrate communities and also to assess ecological classification system as a tool for the management and conservation purposes. The amount of variation explained in macroinvertebrate taxa composition is within values r...

متن کامل

Positioning of Industries in Cyberspace Evaluation of Web Sites Using Correspondence Analysis

  In today’s extremely competitive markets it is crucial for companies to strategically position their brands, products and services relative to their competitors. With the emerging trend in internationalization of companies especially SME’s and the growing use of the Internet with this regard, great amount of attention has been turned to effective involvement of the Internet channel in the mar...

متن کامل

Multiple Correspondence Analysis

Multiple correspondence analysis (MCA) is an extension of correspondence analysis (CA) which allows one to analyze the pattern of relationships of several categorical dependent variables. As such, it can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative. Because MCA has been (re)discovered many times, equiv...

متن کامل

میزان تطابق نتایج نظر‌سنجی از دانشجویان با نتایج نظر‌سنجی از مدیران گروه‌های آموزشی در ارزشیابی فعالیت‌های آموزشی اعضای هیأت علمی دانشگاه علوم پزشکی زنجان از سال 84 تا 85

  Background and Objective: One way to promote education quality is to evaluate faculty activities by students, departmental heads, and the dean. The degree of correspondence between students' and departmental heads' evaluation shows how honest they are in their assessment. The current paper intends to find the truth.     Materials and Methods : Questionnai...

متن کامل

Nonlinear Principal Component Analysis and Related Techniques

Principal Component Analysis (PCA from now on) is a multivariate data analysis technique used for many different purposes and in many different contexts. PCA is the basis for low rank least squares approximation of a data matrix, for finding linear combinations with maximum or minimum variance, for fitting bilinear biplot models, for computing factor analysis approximations, and for studying re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997